[1] "X" "price" "Address" "area"
[5] "latitude" "longitude" "Bedrooms" "Bathrooms"
[9] "Balcony" "Status" "neworold" "parking"
[13] "Furnished_status" "Lift" "Landmarks" "type_of_building"
[17] "desc" "Price_sqft"
Prediction on the rent of flats and houses in Delhi
In this analysis, I explore the delhi dataframe to gain insights into housing prices in Delhi. Subsequently, I aim to develop predictive models for estimating the rent of both flats and houses in the city.
1 Data Overview and Preparation
Here, there are many interesting variables in the dataset, including details on housing prices, area, location, and amenities such as bedrooms, bathrooms, balcony, lift, and parking. Additionally, it provides information on property features like new/resale status and furnished/unfurnished conditions. Leveraging these variables, we can effectively discern and predict the variations in price ranges across different properties in Delhi.
2 Pre-Processing
2.1 Units
Code
# 1. Units
delhi$area_sqm <- delhi$area * 0.0929
conversion_factor <- 0.011
delhi$price_eur <- delhi$price * conversion_factor
delhi <- delhi %>%
mutate(price_per_sqm = price_eur / area_sqm)2.2 NAs
Code
# 2. NAs
missing_values <- colSums(is.na(delhi))
delhi <- delhi %>%
mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .)))
#sapply(delhi, function(x) is.numeric(x) && any(x == 0)) #to check which variable have 0
#colSums(is.na(delhi))Pros and Cons of Replacing Missing Values with 0
Pros:
It helps in maintaining the structure of the dataset.
It allows to perform calculations on the variables without encountering NA issues.
Cons:
It might introduce bias, especially if missing values were not truly zero.
It assumes that missing values mean absence rather than unknown or undefined.
2.3 Log-Transformation
Code
#This line selects numeric columns from the delhi dataset excluding the longitude and latitude columns. The result is a vector of column names (cols_no_lonlat) for numeric variables that will undergo log transformation.
cols_no_lonlat <- delhi |>
select(where(is.numeric), -c(longitude, latitude)) |>
names()
delhi <- delhi |>
mutate(
across(
where(~is.numeric(.x) && min(.x) == 0),
~.x + 1)) |> #the code uses the mutate(across()) function to add 1 to all numeric columns in the dataset (delhi) where the minimum value is 0. This is done to avoid issues with log transformation when the value is 0.
mutate(
across(
all_of(cols_no_lonlat),
~log(.x),
.names = "{.col}_log"
)
) #the code applies the log transformation to all numeric columns identified in the cols_no_lonlat vector (excluding longitude and latitude). The result is stored in new columns with names suffixed by "_log".2.4 Spliting the Data
Code
set.seed(123) # for reproducibility
train_index <- sample(1:nrow(delhi), 0.7 * nrow(delhi))
# Create training and testing datasets
delhi_train <- delhi[train_index, ]
delhi_test <- delhi[-train_index, ]Here I select a split 70% of the data for training and the rest for testing because I think 80% can provide more information for model training but could also leave less data for testing and validation. On the other hand, a smaller training set may lead to underfitting. Therefore, I think 70% for training set share is the best choice for the split.
3 Exploratory Analysis
3.1 Mapping Space and Price per Square Meter
3.1.1 Original and Log-Transformed Price per Square Meter of Flats and Houses in Delhi
In this map, each data point is assigned a color based on its original price per square meter. Darker red colors may represent higher prices, and dark blue colors may represent lower prices. This map can show us a direct visualization of the spatial distribution of housing prices in Delhi. It allows us to quickly identify areas with higher or lower average prices. As the graph shown above, we can see that the cheaper price per square meter is around the outer area of the city center. Meanwhile the more expensive area is close to the city center of Delhi.
In this map, colors are assigned based on the log-transformed price per square meter. Darker red and blue colors may still represent higher and lower prices, respectively, but now the scale is logarithmic. I used log transformation to handle skewed data and make it easier to visualize the relative differences in lower price ranges.
Generally, the colors of the second map may be more informative to enhance the visibility of the price ranges and highlight the relative differences across a wide range of prices in the city of Delhi. However, I would prefer using original price per square meter because it is easier to interpret each data point based on the original price per square meter and allow us to quickly identify areas with higher and lower average prices.
3.1.2 Original and Log-Transformed Area per Square Meter of Flats and Houses in Delhi
In this map, prominent clusters of dark blue colors across the city of Delhi indicate areas with generally smaller absolute sizes of flats and houses. However, the presence of some green and dark red data points signifies specific areas characterized by particularly spacious flats and houses, offering a nuanced view of the diverse housing sizes within the city.
Conversely, using the log-transformed area per square meter allows for the accentuation of areas where the distribution of the original area per square meter was skewed. This transformation serves to normalize or spread out the values, providing a more insightful visualization and enhancing our ability to perceive relative flats and housing size variations across the city. As the graph shown above, the map based on log-transformed area per square meter effectively highlights that in the north of Delhi may have smaller sizes. Meanwhile, the larger apartments and houses may provided more in the south of the city.
3.2 Categories and Price
When contemplating a property purchase, it is crucial to weigh various factors, including the choice between a flat or an individual house, opting for a new construction or a resale property, deciding on furnished or unfurnished spaces, and determining whether to go for a ready-to-move unit or one under construction.
To effectively analyze and illustrate the disparities in prices associated with these elements, I created a series of boxplots and used the log-transformed price per square meter (price_per_sqm_log) because this price per square meter approach standardizes the comparison by considering the price relative to the size of the property. It helps in assessing the cost efficiency in terms of the space we are getting for the price. Moreover, the data of the original price per square meter has skewed distribution, so I used log-transformation approach to visualize better in terms of the relative differences in price ranges.
3.2.1 Individual Houses and Flats
3.2.2 New or Resales Properties
3.2.3 Furnished or Unfurnished
3.2.4 Ready or Under-Construction Properties
According to the series of boxplots shown above, the best strategy to save money is to buy a flat or a house which is resale, unfernished and ready to move.
3.3 Size and Price
3.3.1 Size and Price with Parking Availability
At this stage, I aim to explore the relationship between apartment size and total price, taking into account the influence of parking availability. I will depict this connection using two sets of variables: the original area and price data, as well as the log-transformed counterparts. By comparing these two visualizations, I intend to illustrate the differences and assess which representation provides more informative insights into the impact of parking availability on the size and price dynamics of apartments.
As the graphs shown about the size of the apartments and the total price with parking and balcony availability, we can see the main difference between the graphs that each data point are more spread with log-transformed data, compared to the original one. However, I think the first approach with the original area per square meter and the original total price is more informative than the log-transformation because we can interpret and relate the colors to our real-world perception of high and low prices.
3.3.2 Size and Price with Balcony Availability
4 Preliminaries and Hypothesis Testing
4.1 Spliting the Data and Setting the Algorithm
Before moving on to hypothesis testing, I will conduct a preliminary stage by splitting the data set into 60% for training data and the rest for testing data.
Code
set.seed(0421)
data_split <- initial_split(delhi, prop = 0.6)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
Regression_OLS <- linear_reg() |>
set_mode("regression") |> # Machine learning: regression or classification
set_engine("lm")4.2 Two Variables and Two Hypotheses
4.2.1 An Overview of the Correlations in the Data
For the first start, I would like to create a correlation matrix to understand an overview of the correlations in the data and which variables are high correlated with total price. As the output shown, you can see that the total price has high correlation with area_sqm, Bathrooms, and Bedrooms.
4.2.2 Two Predictor Variables Selection and Two Hypotheses
Based on the exploratory analysis part, I select two variables which are area_sqm (area per square meter) and parking which I expect to have a substantial influence on housing price in Delhi.
Next, I will formulate 2 hypotheses about these variables as follows:
Hypothesis 1:
Null Hypothesis (H0): There is no significant relationship between the size of the property (area_sqm) and housing prices.
Alternative Hypothesis (H1): There is a significant positive/negative relationship between the size of the property (area_sqm) and housing prices.
Explanation: If the null hypothesis is rejected, it suggests that the size of the property has a significant impact on housing prices. The direction of the relationship (positive/negative) will indicate whether larger properties tend to have higher or lower prices.
Hypothesis 2:
Null Hypothesis (H0): There is no significant relationship between the parking and housing prices.
Alternative Hypothesis (H1): There is a significant relationship between the parking and housing prices.
Explanation: If the null hypothesis is rejected, it implies that the parking has a significant effect on housing prices. This could mean that either an increase or decrease in the number of parking is associated with a corresponding increase or decrease in housing prices.
4.3 Pre-processing steps
4.3.1 Identify Missing Values
As the original dataset delhi, there are two variables Balcony and parking that have missing value. However, I already identified and handled those missing values in the previous step. In the first pre-processing step before developing predictive models, it is necessary to check whether there are any missing values in the response variable price_eur and in these two predictor variables area_sqm and parking because missing values in these variables can affect the performance of my predictive model and I would like to ensure the overall quality of my dataset to not lead to biased or incomplete analysis.
Code
# 1. check if these response/predictor variables have any missing values
summary(is.na(delhi$price_eur)) Mode FALSE
logical 7738
Code
summary(is.na(delhi$area_sqm)) Mode FALSE
logical 7738
Code
summary(is.na(delhi$parking)) Mode FALSE
logical 7738
4.3.2 Log-Transformation for Skewed Data
Another pre-processing step I used is log transformation becauseI noticed that the response variable price_eur and the predictor variables area_sqm have skewed distributions which can impact the performance of models to be violated. Since I would like to mitigate the impact of extreme values and make the distribution more symmetrical, I decided to use log-transformation with these variables to improve the performance of the models that require more balanced data.
Code
# 2. check if these response/predictor variables have normal or skewed distribution
hist(delhi$price_eur) #skewed distributionCode
hist(delhi$price_eur_log) # normal distributionCode
hist(delhi$area_sqm) #skewed distributionCode
hist(delhi$area_sqm_log) # normal distributionCode
suppressWarnings({
train_data %>%
ggplot(aes(x = area_sqm, y = price_eur)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, formula = y ~x, warning = FALSE) +
theme_light()
})Code
suppressWarnings({
train_data %>%
ggplot(aes(x = area_sqm_log, y = price_eur_log)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, formula = y ~x, warning = FALSE) +
theme_light()
})As mentioned, addressing missing values and skewed distributions in the pre-processing step is crucial to build and increase the robustness and accuracy of the predictive models.
4.4 Fitting the Model (Intercept the Model)
Another step is to fit the model by using the training data to estimate the parameters of the models such as the intercept and coefficients for each predictor variable.
Code
TM0 <- fit(Regression_OLS, price_eur ~ 1, data = train_data)
#glance(TM0)
rmse(as.data.frame(cbind(train_data$price_eur, TM0$fit$fitted.values)), V1, V2)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 79811.
As you can see, the RMSE value of 79811.5 indicates the average absolute difference between the actual prices (price_eur) and the predicted prices by the model. In other words, on average, the model’s predictions are off by approximately 79811.5 units of the currency used for the prices.
4.5 Model Development
As this stage, I selected two variables area_sqm and parking as predictors which I think they may influence housing price. Then, I developed various predictive models to see which models will perform the best for predicting the housing price in Delhi.
4.5.1 Model 1: TM1 Model
Code
TM1 <- fit(Regression_OLS, price_eur ~ area_sqm, data = train_data)
tidy(TM1)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -41649. 1356. -30.7 9.37e-189
2 area_sqm 1014. 9.16 111. 0
Code
glance(TM1)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.725 0.725 41828. 12260. 0 1 -55983. 111971. 111991.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Model 1 (TM1):
Coefficients: The coefficient for
area_sqmis 1014.41. This suggests that for each additional square meter increase in the area, the predicted price_eur increases by 1014.41 euros.Standard Errors: The standard error for
area_sqmis 9.16.Significance: Both intercept and
area_sqmhave very small p-values (close to zero), indicating that they are statistically significant.Goodness of Fit: The R-squared value is 0.7255. This means that approximately 72.55% of the variability in price_eur is explained by the model. The R-squared value is relatively high, suggesting that the model explains a substantial portion of the variability in price_eur.
4.5.2 Model 2: TM2 Model
Code
TM2 <- fit(Regression_OLS, price_eur ~ parking, data = train_data)
tidy(TM2)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 92355. 1178. 78.4 0
2 parking -33.8 31.1 -1.09 0.277
Code
glance(TM2)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.000255 0.0000391 79819. 1.18 0.277 1 -58982. 117971. 117990.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Model 2 (TM2):
Coefficients: The coefficient for
parkingis -33.77. This suggests that, holding other variables constant, each unit increase in the parking variable is associated with a decrease in the predicted price_eur by 33.77 euros.Standard Errors: The standard error for
parkingis 31.07.Significance: The p-value for parking is 0.2772, which is larger than the typical significance level of 0.05. This suggests that the variable parking may not be statistically significant in predicting price_eur.
Goodness of Fit: The R-squared value is 0.0002545. This indicates that the model explains a very small proportion of the variability in price_eur. In other words, the predictor variable
parkingdoes not contribute much to explaining the variation in the response variable or the model has very low explanatory power.
4.5.3 Model 3: TM3 Model
In real estate, the number of bedrooms and bathrooms are fundamental features that influence housing prices. On the other hand, the presence of a balcony is an additional feature that can impact the overall price of a property. Moreover, based on the correlation matrix, Bedrooms and Bathrooms are highly correlated with price_eur (total price), indicating a strong relationship.
As mentioned, I selected Bathrooms, Bedrooms and Balcony as control variables and add in the following models to see whether the predictive models will perform better than the previous models.
Code
TM3_Model <- recipe(
price_eur ~ area_sqm + Bathrooms + Bedrooms + Balcony,
data = train_data
)
summary(TM3_Model)# A tibble: 5 × 4
variable type role source
<chr> <list> <chr> <chr>
1 area_sqm <chr [2]> predictor original
2 Bathrooms <chr [2]> predictor original
3 Bedrooms <chr [2]> predictor original
4 Balcony <chr [2]> predictor original
5 price_eur <chr [2]> outcome original
Code
TM3 <- fit(Regression_OLS, price_eur ~ area_sqm + Bathrooms + Bedrooms + Balcony, data = train_data)
tidy(TM3)# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -50897. 2308. -22.0 1.51e-102
2 area_sqm 921. 15.4 59.6 0
3 Bathrooms 10238. 1280. 8.00 1.62e- 15
4 Bedrooms -578. 1111. -0.520 6.03e- 1
5 Balcony -939. 433. -2.17 3.02e- 2
Code
glance(TM3)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.730 0.730 41484. 3136. 0 4 -55943. 111898. 111937.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Model 3 (TM3):
- Coefficients:
The coefficient for
area_sqmis 920.56. For each additional square meter of area, the estimated price increases by 920.56 euros.The coefficient for
Bathroomsis 10,237.59. This suggests that, each additional bathroom is associated with an increase in price by 10,237.59 euros.The coefficient for
Bedroomsis -578.10. This suggests that, each additional bedroom is associated with a decrease in price by 578.10 euros. This result might seem strange, and I need to consider interactions with other variables.The coefficient for
Balconyis -939.48. This suggests that, each additional balcony is associated with a decrease in price by 939.48 euros.
- Standard Errors: all the standard errors seem reasonable, suggesting that the coefficient estimates are likely reliable.
The standard error for
area_sqmis 15.43.The standard error for
Bathroomsis 1280.45.The standard error for
Bedroomsis 1111.33.The standard error for
Balconyis 433.44.
- Significance:
area_sqmandBathrooms: These variables have very low p-values (close to zero), indicating that they are statistically significant predictors of the price.BedroomsandBalcony: Bedrooms have a p-value greater than 0.05, suggesting it might not be a statistically significant predictor. Balcony, with a p-value of 0.03, is statistically significant at the 0.05 level.
- Goodness of Fit: The R-squared value of 0.7301 indicates that the model explains approximately 73.01% of the variability in the response variable (price). This suggests a reasonably good fit.
In summary, this model suggests that area_sqm and Bathrooms are strong predictors of the housing price, while Bedrooms and Balcony may have less impact.
4.5.4 Model 4: TM4 Model
Code
TM4 <- fit(Regression_OLS, price_eur ~ parking + Bathrooms + Bedrooms + Balcony, data = train_data)
tidy(TM4)# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -91536. 2933. -31.2 3.25e-194
2 parking -7.73 21.5 -0.360 7.19e- 1
3 Bathrooms 51025. 1439. 35.5 7.03e-244
4 Bedrooms 17908. 1419. 12.6 6.24e- 36
5 Balcony 2688. 571. 4.71 2.58e- 6
Code
glance(TM4)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.523 0.523 55147. 1271. 0 4 -57264. 114541. 114580.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Model 4 (TM4):
- Coefficients:
The coefficient for
parkingis -7.73. Each additional parking space is associated with a decrease in price by 7.73 euros.The coefficient for
Bathroomsis 51,025.25. This suggests that, each additional bathroom is associated with an increase in price by 51,025.25 euros.The coefficient for
Bedroomsis 17,908.05. This suggests that, each additional bedroom is associated with an increase in price by 17,908.05 euros.The coefficient for
Balconyis 2,687.60. This suggests that, ach additional balcony is associated with an increase in price by 2,687.60 euros.
- Standard Errors: all the standard errors seem reasonable, suggesting that the coefficient estimates are likely reliable.
The standard error for
area_sqmis 21.48.The standard error for
Bathroomsis 1439.08.The standard error for
Bedroomsis 1418.87.The standard error for
Balconyis 570.87.
- Significance:
The
parkingvariable has a p-value greater than 0.05, suggesting it might not be a statistically significant predictor.However, all other variables have very low p-values (close to zero), indicating that they are statistically significant predictors of the price.
- Goodness of Fit: The R-squared value of 0.5231 indicates that the model explains approximately 52.31% of the variability in the response variable (price). This suggests a moderate fit.
In summary, this model suggests that Bathrooms, Bedrooms, and Balcony are statistically significant predictors of housing price. However, parking does not appear to be a significant predictor in this model.
4.6 Model with Pre-processing Steps: PCA or Log-Transformation
From my perspective, I want to develop models with both pre-processsing steps (PCA or Log-transformation) to see the differences between them.
Code
#names(train_data) # get a list of variable names to speed this up
TM3ln_Model <- recipe(price_eur_log ~ ., # we need to select all variables first for pre-processing to work
data = train_data) %>% # same as above
step_log(area_sqm, offset = 1, base = 10) %>% # log-transform predictor
step_log(Bathrooms, offset = 1, base = 10) %>%
step_log(Bedrooms, offset = 1, base = 10) %>%
step_log(Balcony, offset = 1, base = 10) %>%
step_rm("price_eur", # remove the old response / dependent variable
"X", "price", "Address", "area", "latitude", "longitude", # remove all predictors not to be included
"Status", "neworold", "parking", "Furnished_status", "Lift",
"Landmarks", "type_of_building", "desc", "Price_sqft", "price_per_sqm",
"X_log","price_log", "area_log", "Bedrooms_log", "Bathrooms_log", "Balcony_log",
"parking_log", "Lift_log", "Price_sqft_log", "area_sqm_log", "price_per_sqm_log")
summary(TM3ln_Model) # A tibble: 33 × 4
variable type role source
<chr> <list> <chr> <chr>
1 X <chr [2]> predictor original
2 price <chr [2]> predictor original
3 Address <chr [3]> predictor original
4 area <chr [2]> predictor original
5 latitude <chr [2]> predictor original
6 longitude <chr [2]> predictor original
7 Bedrooms <chr [2]> predictor original
8 Bathrooms <chr [2]> predictor original
9 Balcony <chr [2]> predictor original
10 Status <chr [3]> predictor original
# ℹ 23 more rows
Code
TM3ln_Model_prep <- prep(TM3ln_Model, training = train_data)
PCA_Model <- recipe(price_eur ~., data = train_data) |> # Full Model
step_rm(price_eur_log, # remove alternative response
"X", "price", "Address", "area", "latitude", "longitude", # remove the factors & other predictors
"Status", "neworold", "Furnished_status", "Landmarks",
"type_of_building", "desc", "Price_sqft", "price_per_sqm","X_log",
"price_log", "area_log", "Bedrooms_log", "Bathrooms_log", "Balcony_log",
"parking_log", "Lift_log", "Price_sqft_log", "area_sqm_log", "price_per_sqm_log") |>
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(c("area_sqm", "Bedrooms", "Bathrooms"), num_comp = 1, prefix = "PC1main") %>%
step_pca(c("Balcony", "parking", "Lift"), num_comp = 1, prefix = "PC2additional")
summary(PCA_Model)# A tibble: 33 × 4
variable type role source
<chr> <list> <chr> <chr>
1 X <chr [2]> predictor original
2 price <chr [2]> predictor original
3 Address <chr [3]> predictor original
4 area <chr [2]> predictor original
5 latitude <chr [2]> predictor original
6 longitude <chr [2]> predictor original
7 Bedrooms <chr [2]> predictor original
8 Bathrooms <chr [2]> predictor original
9 Balcony <chr [2]> predictor original
10 Status <chr [3]> predictor original
# ℹ 23 more rows
Code
# Execute the preprocessing (optional)
PCA_Model_prep <- prep(PCA_Model, training = train_data)Considering the characteristics of the original dataset, I prefer employing log-transformation as a pre-processing step for the model. This choice is suitable for the observed skewed distribution in both response and predictor variables. Log-transformation is beneficial in this context as it helps create more symmetric distributions, thereby enhancing model performance. Additionally, Principal Component Analysis (PCA) is a suitable approach when dealing with high-dimensional data, but since the current models do not involve a large number of predictors, the need for dimensionality reduction is not prominent in this case.
4.7 Training the Models and Getting the Training Errors
Code
TM3_ols_wf <- workflow() %>%
add_recipe(TM3_Model) %>% ## terminology: this is the model
add_model(Regression_OLS) #
TM3ln_ols_wf <- workflow() %>%
add_recipe(TM3ln_Model) %>% ## terminology: this is the model
add_model(Regression_OLS) # terminology!!! this is the algorithm we are adding
PCA_ols_wf <- workflow() %>%
add_recipe(PCA_Model) %>% ## terminology: this is the model
add_model(Regression_OLS) 4.8 Training the Models
Code
TM3_ols_fit <- fit(TM3_ols_wf, data = train_data)
TM3ln_ols_fit <- fit(TM3ln_ols_wf, data = train_data)
PCA_ols_fit <- fit(PCA_ols_wf, data = train_data)4.9 An Overview of the Model Results
4.9.1 TM3 Model Result
Code
# TM3_ols_fit
rmse(as.data.frame(cbind(train_data$price_eur, TM3_ols_fit$fit$fit$fit$fitted.values)),V1, V2)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 41462.
4.9.2 TM3ln Model Result
Code
# TM3ln Model on original price
rmse(as.data.frame(cbind(train_data$price_eur,
exp(TM3ln_ols_fit$fit$fit$fit$fitted.values))),V1, V2)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 44242.
4.9.3 PCA Model Result
Code
# PCA_ols_fit
rmse(as.data.frame(cbind(train_data$price_eur, PCA_ols_fit$fit$fit$fit$fitted.values)),V1, V2)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 48026.
4.9.4 Coefficient and Overall Model Output
The coefficients provide insights into how each predictor influences the model which are summarized as follows:
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | −50896.544*** | 5.290*** | 92224.526*** |
| (2308.239) | (0.063) | (705.128) | |
| area_sqm | 920.563*** | 2.441*** | |
| (15.434) | (0.046) | ||
| Bathrooms | 10237.589*** | 0.917*** | |
| (1280.451) | (0.091) | ||
| Bedrooms | −578.095 | 0.601*** | |
| (1111.340) | (0.086) | ||
| Balcony | −939.479* | 0.000 | |
| (433.446) | (0.027) | ||
| PC1main1 | 39923.851*** | ||
| (451.849) | |||
| PC2additional1 | 3150.901*** | ||
| (622.467) | |||
| Num.Obs. | 4642 | 4642 | 4642 |
| R2 | 0.730 | 0.724 | 0.638 |
| BIC | 111936.5 | 2803.5 | 113284.1 |
An overview of the the model results indicate that:
TM3 model has the lowest RMSE, indicating better predictive performance.
The R-squared values suggest that TM3 model has the highest goodness of fit among the models.
In conclusion, based on the RMSE and R-squared values, the TM3 Model appears to be the best-performing model for predicting housing prices in Delhi.
(to be continued)